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Abstract 

The discovery of events in time series can have important im- 
plications, such as identifying microlensing events in astro- 
nomical surveys, or changes in a patient's electrocardiogram. 
Current methods for identifying events require a sliding win- 
dow of a fixed size, which is not ideal for all applications 
and could overlook important events. In this work, we de- 
velop probability models for calculating the significance of 
an arbitrary-sized sliding window and use these probabilities 
to find areas of significance. Because a brute force search 
of all sliding windows and all window sizes would be com- 
putationally intractable, we introduce a method for quickly 
approximating the results. We apply our method to over 
100,000 astronomical time series from the MACHO survey, 
in which 56 different sections of the sky are considered, each 
with one or more known events. Our method was able to 
recover 100% of these events in the top 1% of the results, es- 
sentially pruning 99% of the data. Interestingly, our method 
was able to identify events that do not pass traditional event 
discovery procedures. 

1 Introduction 

Event discovery in time series data is the focus of many 
modern temporal data mining methods [H [TTJ [13] . An 
event is characterized by an interval of measurements 
that differs significantly from some underlying base- 
line. The difficulty in identifying these events is that 
one must distinguish between events that could have 
occurred by chance and events that are statistically sig- 
nificant. In this paper, we present a method that is 
noise independent and determines the significance of an 
interval. 

We focus our tests in astronomy, for which discov- 
eries could be found by identifying events in time series 
data. For example, microlensing events occur when an 
object passes in front of a light source and acts as a grav- 
itational lens, thus increasing the magnitude of light be- 
ing observed. Figure [T] is an example of such an event. 
Each light curve (time series data recorded by astro- 
nomical observations) has different noise characteristics, 
making it difficult to form a general noise model for all 
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time series in the setjj Furthermore, there are millions 
of observed light curves in modern astronomical surveys, 
many with uninteresting fluctuations or trends, thus any 
effective method must be able to distinguish between 
these variations and statistically significant events such 
as microlensing, flares and others. Examples of other 
applications for which event detection is useful include 
searching for events in stock market data [5] [28], ex- 
amining CPU usage to identify anomalies, and finding 
irregularities in medical data [231 I26| . 
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Figure 1: Example: A microlensing event in the MA- 
CHO data set, with ID 104.20121.1692. The X-axis 
represents time in seconds; the Y-axis represents mag- 
nitude. 2 

Searching for events in time series requires first 
searching for overdensities, intervals of time series with 
a large deviation from the baseline [TH], and then 
determining the statistical significance of the region. A 
naive approach would be to explore all possible intervals 
in order to identify the region with the largest deviation 

i The reasons for varying noise characteristics in astronomical 
observations are plentiful: weather (clouds, wind, etc) create 
trends, the temperature can change the CCD characteristics, 
different stresses on the mechanical structure depending on the 
direction of the telescope can stress the optics differently and 
consequently make a difference in how the light is bent, and other 
possible effects of the environment. 

2 The magnitude of an observation is related to the logarithm 
of the intensity of the light being observed for each star. 



from the baseline. More efficient methods for identifying 
these regions have been explored in [19] and [18]. 

After determining regions of the time series that 
deviate from the baseline, one must determine if these 
regions are statistically significant, which requires mod- 
eling the noise of the time series. Modern methods cre- 
ate such models by performing randomization testing, in 
which a time series is randomly reshuffled several times 
and the original interval is compared to the most anoma- 
lous interval found in each shuffle [IB] . The downfall of 
performing this randomization testing is that it requires 
significant computation, and needs to be performed for 
each new time series. This technique is intractable for 
domains in which we have thousands or even millions of 
time scries, such as light curves in astronomy. 

Our approach falls under the broader category of 
scan statistics [HIE], which claims that by considering 
sliding windows, or intervals of a time series, one can 
determine statistical significance if the underlying noise 
can be understood. To remove the need to model the 
noise for each time series, we begin by first converting 
the time series to rank-space, a discrete series of N 
points representing the rank of each value in a real- 
valued time series, where I is the lowest value and N 
is the highest. This creates a uniform distribution of 
points across the Y-axis, independent of the underlying 
noise. This allows for a probability model to be formed 
that does not require a model of the noise in the time 
series (this probability model is described in Section 
3.3 1. Because the model is applicable to any time series 
of length N that has been converted to rank-space, it has 
the added benefit that it allows an analyst to compare 
the significance of events across different time series. 

Given the probability model described, each interval 
has a p- value, which represents the likelihood of the 
region occurring by chance. The method described in 
this paper considers all possible intervals of a time series. 
In other words, a variable window size is used in the 
analysis, and a p- value can be assigned to each window 
size. To make this search tractable, a well-known 
optimization technique |21] is applied that approximates 
the solutions, by finding the minimum p- value for all 
possible intervals. 

In applying our method, we ran our method on 
the MACHO data set pQ. Out of 110,568 time series, 
our method was able to recover all known microlcnsing 
and blue star events [5] [TJ]. Furthermore, many of 
the events found in the top results would not have 
passed traditional tests, such as a x 2 test- Results of 
the MACHO analysis are explained further in Section 
|5.1| In addition, Section [5T2] details the analysis of 2000 
synthetic time series, in which all events were found, 
including zero false positives. 



Finally, we compare the proposed approach to HOT 
SAX, an anomaly detection algorithm [13]. In doing so, 
we outline the essential differences between an anomaly 
detection algorithm (i.e., HOT SAX) and an event 
detection algorithm. 

In the following sections, we begin by describing the 
related work in scan statistics and anomaly detection in 
Section [2] We then describe our method for forming a 
probability model to assess the statistical significance 
of an interval in Section [3] and how one can identify 
events based on these probability models. In Section [4] 
we describe how we can apply these methods to large 
datasets by approximating results. Finally, Section [5] 
reports our results from an application to an existing 
astronomical survey, MACHO, and a study of synthetic 
time series. 

2 Related Work 

Perhaps the work most closely related to event detection 
in time series is scan statistics, whose goal is to deter- 
mine the probabilities and statistics of spatial subsets. 
Scan statistics are applied in most cases to temporal 
data, but in some cases are generalized to any num- 
ber of d dimensions. The goal of scan statistics is to 
discover statistically significant densities of points that 
deviate from some underlying baseline (see [9] for a de- 
tailed review). Scan statistics have been used to refute 
decisions made based on false identification of seemingly 
significant events. Examples of such events include the 
grounding of all F-14s by the U.S. Navy after a cluster of 
crashes, and when a judge fined a Tennessee State Pen- 
itentiary Institution after an unusual number of deaths 
occurred at the facility [5]. Scan statistics were used 
in both cases to show that the cluster in question was 
likely to have occurred by chance, and thus was not 
statistically significant. 

To see how they are applied to time series data, 
consider a time series with Z points, each independently 
drawn from the uniform distribution of time intervals 
on [0,1) [5]. The scan statistic S w is the largest 
number of points to be found in any interval of length 
w, thus S w is the maximum cluster with window 
size w |17j . The determination of the distribution of 
the statistic P(k\Z,w) is fundamental to this analysis, 
where P(k\Z, w) represents the probability of observing 
k points with the window w. Much work has been 
done to find approximate results for this statistic, such 
as the use of binomial probabilities [37], and Poisson 
distributions [5]. Generally, these approximations work 
well for small values of w and k, with the varying 
levels of complexity and accuracy. It is important to 
note that in many applications, a rough approximation 
is sufficient to find significance. Using this statistic, 



one can determine if an interval of the time series is 
not consistent with the noise, and thus is statistically 
significant. In this work, we describe a new statistic 
for assessing the statistical significance of any arbitrary 
window by forming a new probability distribution, 
motivated by work in scan statistics, that will scale to 
large values of w. Our method has an analytic method 
for finding the exact probability distribution. Finally, 
by forming the distribution for all window sizes, one can 
analyze significance for any arbitrary window size. 

Existing methods in scan statistics currently do not 
address the problem of large data sets (e.g., millions 
or billions of time series). In many applications, one 
requires an event detection algorithm that can be per- 
formed quickly on many time series. In addition, be- 
cause much of the data for these applications has noise 
that is difficult to understand and model, it is critical 
to develop a method that is independent of the noise 
characteristics. The current methods in scan statistics 
are generally based on particular noise models, such as 
Poisson distributions |Fj and binomial probabilities [S7J . 
Both of these characteristics exist in astronomy, and 
thus our goal in this paper is to address both of these 
issues. Due to the intractability of current scan statis- 
tics methods, we cannot compare our method. Thus, 
we are compelled to compare to anomaly detection due 
to its speed, efficiency and that it does not require first 
modeling the noise. 

Anomaly detection in time series is closely related 
to the problem of searching for events in time series. 
In [T3], time series discords are used to determine the 
interval most likely to be an anomaly. Keogh et al 
present a method for determining these discords for 
a fixed-size sliding window. At its core, the method 
considers all possible intervals of the time series for a 
given size (a parameter provided by the domain expert), 
and determines the subset with the largest distance from 
its nearest non-self match (a distinction made to avoid 
trivial matches). Requiring the analyst to provide only 
one parameter is a significant improvement over past 
methods which require several unintuitive parameters 
|14j . Defining a window size in advance is realistic when 
the expert has knowledge about the size of the event, 
or when the time series is periodic and the event is 
likely to occur during one period. As an example, this 
is the case when analyzing an electrocardiogram time 
series [T3] ■ On the other hand, when an expert does 
not know the characteristics of the event or when the 
event could occur at different intervals, it is preferable 
to examine any arbitrary window size. Such is the case 
in astronomy, where one may be searching for any novel 
discovery represented by possible significant changes in 
light patterns (e.g., microlensing) that each may have a 



different time interval. 

It is important to understand the distinction be- 
tween anomaly detection and event detection. When 
detecting anomalies, such as in 13J, the anomalies are 
discords from the time series (or set of time series). 
This is fundamentally different from the goal of this pa- 
per, which is to find subintervals that are statistically 
significant from the underlying noise. In other words, 
anomaly detection finds those subintervals that differ 
most from the rest of the ensemble, whereas event de- 
tection searches for those that significantly differ from 
the noise. 

Methods of identifying events in astronomical data 
exist in the literature. Traditionally for fast identifica- 
tion of variables, a x 2 test of a straight line has been 
used, but overlooks events of short duration. Other 
methods involve fitting event shapes to the light curves, 
generally using a least-squared fit [TO]. This can be 
thought of as creating synthetic light curves with the 
characteristics of known events, and testing these hy- 
potheses. The fitting of events is an accurate method 
for event discovery, yet it is slow and quite expensive. 
What makes our method particularly valuable is that 
it can find events that escape the \ 2 test and is less 
expensive than event-fitting tests. 

While scan statistics build a sound framework for 
determining the significance of events, there are signifi- 
cant challenges that must be overcome in order to deal 
with large amounts of high-dimensional data. Our work 
addresses these issues by first performing a transforma- 
tion of the data to a uniform representation and second, 
developing a probability distribution for assessing the 
significance of intervals of a time series. Furthermore, 
although anomaly detection is related to event detec- 
tion, it is fundamentally different and not particularly 
well suited for our application. Our empirical results in 



Section 5.1 make the distinction between anomaly and 
event detection clear. 

3 Assessing the Statistical Significance of an 
Interval 

The following work differs from the work in scan statis- 
tics in three key ways. First, we do not consider the 
temporal distribution of individual events, but instead 
bin the events in equidistant time intervals. In other 
words, a time series with Z points distributed on [0, 1) 
as defined for scan statistics in Section |2] is transformed 
into a time series by grouping the points into N bins, 
with each bin b representing a constant interval of time. 
Each bin becomes a value Vb in the time series, where Vb 
is the number of points for the bin b. Second, we are not 
concerned in particular with the largest subset S w , but 
with any interval with statistical significance. Third, 



we consider any arbitrary window size w, instead of us- 
ing a fixed value. Forming a distribution for all window 
sizes will allow an analyst to identify the exact interval 
in question. More importantly, this step introduces a 
method for the comparison of p- values for any arbitrary 
window size. In this section, we present a method for 
simplifying our problem by converting a time series into 
rank-space, and then two methods for creating a proba- 
bility distribution of possible window sums for assessing 
the significance of an interval. 

3.1 Formal Definition. We begin by defining a time 
series T = [vi , V2 , ■ ■ ■ , Vt , . . . , vn] of length N where vt 
is the value of a time series T at time t. It is assumed 
that the length of each time interval [t, t + 1] is equal 
for all 1 < t < N. An interval S C T is defined 
by a starting point s and a window size w. Formally, 
S = [vs, ■ ■ ■ , Ws+w-i] where 1 < s < s+w— 1 < N. 

3.2 Rank-Space Transformation. In order to re- 
move the need to model the noise for each individual 
time series, one can consider a time series Tr that has 
been transformed into rank-space. To this end, we rank 
each point in a time series T from 1 to the number of to- 
tal points N, where the lowest value is assigned a value 
of 1 and the largest is assigned a value of N. This will 
allow us to consider a time series with a uniform distri- 
bution of noise, no matter what the noise distribution 
was before the transformation. 

By examining the sums of the values inside any 
arbitrary sliding window of Tr, we can determine its p- 
value (the probability that the sum occurred by chance). 
How to calculate the p-value is the focus of Sections 
|3.3| and |3.4| Consider a starting point s and a window 
size w, and let Q(s,w) be the sum of the ranked 
values inside that window, where 1 < s < N and 
1 < s + w — 1 < N. Thus, we define the sum Q(s, w) — 
r s + r s+ i + • • • + r s+tu _i where r s , . . .,r s+w -i € T R . 
Our method is based on the assumption that the sums 
in the outer tails of the probability distribution of all 
possible sums are significant, as they deviate drastically 
from the underlying baseline. 

Note that with rank-space one loses the depth 
of the time series. In other words, the amount by 
which an event deviates from the baseline is no longer 
distinguishable after the transformation, due to the 
uniform spacing of the points. For example, a time 
series with a significant event that deviates from the 
baseline with a signal to noise ratio of 5 is no different 
than an event with a signal to noise ratio of 20 in rank- 
space. This is beneficial in many cases in which only a 
few points deviate by large amounts from the baseline, 
but are not necessarily significant events. 



A distinct probability distribution exists for each 
pair of (w,N), because each sum is dependent only on 
the number of values being added together (w), and 
their bounds, 1 < fj < N. Therefore, in order to assess 
the statistical significance of a particular sum Q{s, w) of 
Tr, we must know the distribution of (w,N). We first 
describe an analytic method to find this distribution, 
and due to its computational complexity, a Monte Carlo 
method for approximating the same distribution. 

3.3 Determining the Probability Distribution: 
Analytic Method. We wish to find the probability 
that any sum <j>, obtained by the performing the rank- 
space transformation described above, would appear in 
uniformly random noise with a window size w and a 
maximum value N. Thus, our goal is to produce an 
exact probability distribution function for a given <p, w 
and N. 

To obtain our probability curve, we can count the 
number of ways any cj) can be formed with distinct 
values from l,...,N inclusive, and divide by ( w ) (all 
combinations of w distinct values, which represents all 
ways to form <j> with w values). 

To find the number of ways a single sum <f> can 
be formed, we consider the combinatorial problem of 
finding the number of partitions of cj> with w distinct 
parts, each part between 1 and N, inclusive. In 
other words, we wish to count all possible solutions to 
vi+v-2+- ■ -+v w = <f) where < v\ < vi < . . . < v w < N. 

In order to solve this, we transform the problem 
to match a well-known problem involving a specific 
application of the g-binomial coefficients This 
requires that we slightly modify the problem. The 
inequality above is the same as the following: First, 
subtract 1 from the smallest part, 2 from the second, 
etc. to get < V\ — 1 < v 2 — 2 < ... < v w — w < N — w. 
We have subtracted a total of 1+2+. . .+w = w(w+l)/2, 
so we are now looking for the number of partitions of 
4> — w(w + l)/2 with at most w parts, and with largest 
part at most N — w. 

Following the notation of 0, we define p(n, m, k) to 
be the number of partitions of k into at most m parts, 
each part at most n. This value is the coefficient of q k 
in the generating function, 



(3.1) 



where 



(3.2) 



G(n, m; q) = VVn, m, k) q k 



k>0 



{I - q n+rn\M _ q n+ m -l\ . . . M _ q m+l\ 

G[n,m;q) = — — Y r 

(1 - q n )(l -q n !) • • • (1 - q) 

A basic identity of the coefficients provides the 



recurrence on page 34 of [3]: 

(3.3) p(n, to, k) — p(n — 1, to, k — to) + p(n, m — 1, k) 
where the base case follows from the identity 



(3.4) 



p(n, 0, k) = p(0, to, k) 



1 ifn = m= fc = 
otherwise 



Applying this to our problem, we are looking for distinct 
partitions of k = <\> — w(w + l)/2 with at most to = w 
parts and largest part at most n — N — w. Thus, 
p(N — w, w, (f> — w(w + l)/2) is the number of ways to 
create the sum with w unique parts, each value from 
1 to TV. The probability of obtaining sum cf> will equal: 



(3.5) P{frw,N) 



p{N — w,w,(j) — w{w + l)/2) 



Thus, to build a distribution for (w, N), we find 
P(<f>; w, N) for all 0. 

3.4 Determining the Probability Distribution: 
Monte Carlo Method. Although the analytic results 
are preferable and provide exact probabilities for all 
possible sums, the analytic method is deeply recursive 
and prohibitively expensive. Memoization can be used 
to improve performance, by creating a hash for pruning 
branches of the recursion tree, but the hash still remains 
too large in memory to be practical in all cases. In order 
to perform the memoization method for creating the 
analytic distribution, one must keep a three-dimensional 
structure in memory to keep track of the recurrence 
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of the size w r , 

where w max is the largest window size we wish to search 
for. For example, for w max = 50 and N — 500, we 
require 2.2GB of memory, which may be too large for 
some systems. 

Thus an alternative and less memory demanding 
method is to perform random sampling on all possible 
sums. To perform the Monte Carlo method, we repeat- 
edly choose w unique random numbers from 1 to N, sum 
them, and then count the frequency with which each 
sum appears. Dividing each of the frequency counts by 
the total number of samples gives an approximate prob- 
ability distribution curve. 

Because the tails of the distribution represent the 
windows with the lowest p-values, those counts must be 
determined as accurately as possible. We seek statisti- 
cally significant events, thus we define a threshold a for 
which we would like to consider all events with p- values 
lower than a. Our goal in the accuracy of the prob- 
ability distribution is to keep the p-values around the 



threshold a as accurate as possible. In other words, the 
accuracy of the p- value associated with the event is not 
imperative; it is ensuring that the p- values of events do 
not fluctuate around the threshold a. In a Monte Carlo 
run, we consider to be the frequency for each sum 4>. 
The expected accuracy of the frequency of <fi is given by 

(3.6) e~-^ 



where e is the error |22j . In order to ensure accuracy 
of e for a threshold a, we must obtain a minimum of A 
samples for cj>, given by 

(3.7) A-^-^ 



3.6 



because we know that = \ from Equation 

Figure [2] shows the probability distributions for 
different values of (w, N) calculated using the analytical 
and the Monte Carlo approach. It is important to note 
that e and a are statistically motivated and must be 
defined. In this example, our confidence is a = 10~ 4 
and accuracy of 10% (e = .1). Thus, we must perform 
1, 000, 000 random samples. The error in the tails of the 
distribution seen in Figure [2] is less than 10%, and thus 
consistent with theory. 

Not all time series are of equal length N, and thus 
we cannot use the same probability distributions for ev- 
ery time series. Although one could create a probabil- 
ity distribution for every possible N, this is not prac- 
tical. In order to solve this problem, one can split 
the time series into equal parts of length N s , where 



N s < N and N s > w n 



where 



is the largest 



possible window size. Each subset iV s is then ana- 
lyzed as a single time series. To ensure that the al- 
gorithm does not miss any possible events where any 
two intervals meet, one should overlap the intervals by 
Wmax- More formally, given an interval for a fixed 
N, [vi, i>j+i, . . . , Uj+jv-i]) the subsequent search inter- 
val should be [vi+N-w max >•••■> u »+2iv-tu maa( -i]' Further- 
more, this method may be used to analyze time series of 
large lengths, where it is too computationally complex 
to compute a full probability distribution for the length 
of the time series. In order to ensure a correct modeling 
of the noise of N, a large _/V s must be chosen. This is 
sufficient, such that as N grows large, the probability 
distributions become increasingly similar, and thus do 
not change the significance of the p- value. 

These distributions need only be calculated once. 
As a benefit of having a single model based on w, N, 
one can store the distributions to be recalled for later 
analyses. Because all time series are first converted 
to rank-space and thus have a uniform distribution of 
noise, we can use the same noise distribution for all time 
series that are of the same length. 
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Figure 2: Probability distribution for different values 
of the window size w = 3, 10, 20, and total number of 
points N — 50; Monte Carlo probability distributions 
using 1,000,000 samples. 



4 Computationally Tractable Search 

Given a probability distribution for each possible win- 
dow size w, 1 < w < N, and for a given N, we wish to 
find areas of significance. A naive approach to identi- 
fying events would be to apply a brute force technique. 
We compare each sum Q(s,w) for all s, w to the proba- 
bility distribution defined by the method in Section |3.3| 
or Section 13.41 for w and N. The search examines the 
p- value associated with the sum Q(s,w), and ranks the 
regions by p- value, lowest to highest. The complexity 
of the brute force search is 0(N 2 ). For very large data 
sets, this is unacceptable and a quicker approximation 
is needed. In Section [4~T| an analysis of the probability 
surface plot gathered from analyzing all possible (s,w) 
pairs is shown to be quite revealing, and motivates our 
use of well-known optimization methods for approxima- 
tion of results. In addition, it is important to note that 
we refer to event regions rather than specific (s, w) pairs, 
because similar pairs of (s, w) may represent the same 
event . The algorithm outlined in Section |4.2| presents a 
method for combining similar (or overlapping) windows 
and return representative (s, w) values for each of the 
event intervals. 

4.1 Examining the Probability Surface Plot. A 

two-dimensional, s x w, surface plot of the probabil- 
ities shows clearly where significant events occur in 
this space. For the time series shown in Figure [3] 
Figure [4] depicts the probabilities for each pair (s, w), 
where darker regions corresponds to lower p- values. 
After examination, one notices three regions of low- 
probability points. By locating the minimum of each 
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Figure 3: Synthetic time series with two events. 
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Figure 4: Contour plot of probabilities for all (s, w) pairs 
from analysis of Figure [3] 

low-probability region, one can pinpoint the exact time 
and duration of the event in the time series, by identi- 
fying the starting point s and window size w. In order 
to find these regions algorithmically and efficiently, the 
problem can be reduced to the problem of minimization 
(a special case of the class of optimization algorithms). 

There are several methods for performing the nec- 
essary minimization [15] (2Ql [22]. The success of the 
method detailed in this paper does not depend signifi- 
cantly on the particular minimization technique. Pow- 
ell's method [2T] was chosen due to its efficiency and 
the lack of need to compute derivatives. Our goal is to 
find all minima that are significant because we search 
for multiple events. Section |4~2] addresses this problem. 

4.2 Random Restarts and Combining Results. 

Powell's method begins by selecting a search vector (a 



vector of points on the surface plot), performing a lin- 
ear minimization on the vector, and then selecting a 
new search vector from this point. These steps are re- 
peated until a satisfactory minimum is found. Random 
restarts are introduced to alleviate dependence on the 
original search vector. This helps avoid finding spurious 
solutions, which often appear when using approxima- 
tion methods in the presence of noise. For each random 
restart, a pseudo-random number generator is used to 
seed a different vector on the surface plot. When the 
window size of the desired event to be found is small 
(i.e., when w is small, the event would appear as a 
quick spike in the time series) , any minimization method 
is likely to find spurious solutions. These cases would 
benefit from more random restarts to ensure that the 
significant minima arc found. The analyst must define 
the number of random restarts required (suggested val- 
ues for this parameter are discussed in Section 5.2 1. 



Because many random restarts will identify similar 
instances of the same solution (see Figure [5]), we in- 
troduce an agglomerative clustering method to find the 
best suited final (s, w) pairs for an event region. Con- 
sider two results, (s\,wi) and (s2,u>2)- The goal is to 
determine if they represent the same solution. Thus, we 
consider the overlap of the two results, and if that over- 
lap is within a threshold 8, the two pairs are considered 
to be the same result and are in the same cluster Cj. 
Thus, two results overlap if 



(4.8) 
where 

(4.9) 



9 < (Rtotal 
< (Rtotal 



(Rbegi 
(Rbogi 



+ Rcnd))/Wi 
+ R e nd))/w 2 



Rtotal = max(si + wi, s 2 + w 2 ) 

Rbcgin = |Sl — S2I 

Rend = I (Si + Wi) - (S 2 +W 2 )| 



min(si, s 2 ) 



To build the clusters, we iterate through each result and 
compare the current result (s',u/) with each element 
of each cluster Cj. If no cluster exists in which each 
clement overlaps with the current result, we create a 
new cluster Cfc containing only (s',w r ). Finally, in each 
cluster, the clement in that cluster corresponding to the 
sum with the lowest probability should be used as the 
representative p- value for that cluster. In experiments, 
we found that the results are fairly insensitive to the 
chosen value for 8 (the default value used in our 
experiments was 9 = .75). 

Note that random restarts are beneficial when deal- 
ing with time series that could have multiple events. In 
these situations, the random restarts will allow the min- 
imization to find multiple intervals with low p- values, 
each of which has the potential to be an event. In the 
following section, we explore how to distinguish between 



likely events and spurious solutions when dealing with 
the final (s, w) pairs determined from the clustering al- 
gorithm. 
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Figure 5: Contour plot of the p- values of the analysis 
of Figure [3J each X represents a minimization using 
Powell's method. 



4.3 Identifying Events. After we have found all 
potentially significant events, we rank order them by 
significance for further examination. There are two 
aspects to consider when analyzing the event regions. 
First, one should consider all intervals that are minima 
in the surface plot of p- values for a time series. Each 
of these minima are included in the ranking of events 
for all time series being considered. By identifying all 
minima, we are able to find multiple areas of interest in 
a single time series. Such is the case in Figure [3] which 
is an example of time series with multiple events with 
differing p- values. 

Second, in some cases, where a trend exists or the 
signal-to-noise ratio is low, simply ranking by the p- 
value of each true minima may result in overlooking 
some events. For example, Figure [6] is a time series 
with an upward trend. As one can see from the surface 
plot in Figure [7J the most significant window resides 
at the end of the trend, but there is clearly another 
minimum with significance at s — 70 and w = 13. 
Thus, it is also important to consider the frequency 
with which the event regions were identified during the 
random restarts. This will identify the intervals that are 
clear minima on the probability surface plot, and thus 
could be significant to the time series. The frequency 
of the event regions can be considered by the analyst in 
addition to its p-value. 

Although our method will identify significant events 
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Figure 6: Synthetic time series with trend. 
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Figure 7: Contour plot of the p- values of the analysis of 
Figure [6| 

on a trend like the one in Figure [6j the event regions 
associated with the end of the trend will still appear as 
significant. Thus, in the case of a data set with many 
trends, it may be preferable to first consider filtering 
the time series by detrending them. The most basic 
method would be to subtract the best-fit line. One 
can also consider multiple detrending algorithms for 
specific fields, such as detrending for astronomy [THl IM] 
or detrending for medical applications [6]. 

5 Results 

Two major analyses were performed with the method 
described in this paper. First, we analyze a subset of 
the MACHO survey that was chosen because there are 
known events in the data, and also because it repre- 
sents a large enough size to demonstrate the method's 
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Figure 8: Blue star in MACHO survey (MACHO ID: 
2.5628.4423). Event was found with p- value = 5.008 x 

io- 142 . 

efficiency. Our results show the accuracy of the method 
in our astronomical application. Second, we create syn- 
thetic time series to find empirical results for one of the 
method's parameters, and in addition, show further ev- 
idence of the accuracy of the method. 

5.1 MACHO. The MACHO project spanned several 
years and observed millions of stars in order to detect a 
few rare events. It is a well-known and well-explored 
survey PQ. Our goal is to find at least two known 
types of events in this data: microlensing events [5] 
and evidence of blue stars [15]. Other event types 
exist, but these two examples are well researched and 
cited, and thus serve the purpose of verifying our 
results. Microlensing events are characterized by a 
single significant deviation from the baseline (see Figure 
[I] for an example). Blue stars are characterized by a 
similar shape, and also have a similar duration to a 
microlensing event (see Figure [8] for an example) |12j . 

Our analysis was performed on 110,568 time series 
from 56 tiles (from the red band) , each tile representing 
a portion of the sky. In total, there were 28 known 
microlensing events and 5 known blue stars in the 
subset. Each time series had a length between 1000 and 
1812, with an average length of 1386. Before performing 
our analysis, some basic detrending was performed by 
subtracting a least-squares linear fit for each time series. 
All time series were analyzed with the same probability 
distribution for N — 1000, where any time series larger 
would use the overlapping method described in Section 
I3~4l 

A ranking of the set of possible events (i.e., intervals 
of time series) was done by considering the event with 
the lowest p- value for each interval. Table [l] summarizes 



the analysis of the MACHO data set. There was no dif- 
ference between the results obtained from the analytic 
and Monte Carlo methods. From the results, we ob- 
serve that all known events were found in the top 1.1% 
of the ranks. In order to show that this is a desirable 
result, one must examine the events found to be more 
significant than the lowest ranked known event (rank 
1217). We examined all 1217 significant light curves 
and found that each either contains a significant event, 
or is periodic/pseudo-periodic and thus they appear as 
significant deviations from the baseline. Examples of 
an unidentified and pseudo-periodic events are found in 
Figures [9] and [10] respectively. 
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Figure 9: MACHO ID 80.6468.77. Periodic (or pseudo- 
periodic) event not identified as microlensing or as a 
blue star, p-value is 1.688 x 10 -104 , with a rank of 55. 
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of events in astronomical data, such as performing a 
\ 2 test. For example, the microlensing example in 
Figure 11 has \ 2 = 1-13, but has a p-value significance 
of 4.273 x 10~ 26 and a rank in the above analysis of 
689. Moreover, it is important to note that 371 new 
events with % 2 < 3 were discovered in these top ranks, 
such as the example in Figure [12] We are currently 
examining these events to determine the nature of those 
phenomena. Each event must be examined carefully, by 
comparing colors and desirably spectra information. A 
follow-up paper to appear in an astronomical journal 
will address those cases 123 . 
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Figure 11: MACHO ID 119.20610.5200. Microlensing 
event with p- value of 4.273 x 10~ 26 , but with X 2 = 1-13. 
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Figure 12: MACHO ID 113.18809.1561. Unidentified 
event with p-value is 3.184 x 10 -17 , a rank of 1199, and 
Figure 10: MACHO ID 118.18278.261. Event not x 2 = 0.82. 
identified as microlensing or as a blue star, p- value is 

8.167 x 10~ 112 , with a rank of 27. In addition, we compared our results to HOT SAX 

to make the distinction between event detection and 
In Section [2] we discuss past work for the discovery anomaly detection. The analysis was run on the same 



Table 1: Results of MACHO Analysis, analytic. 



Event Type 


Median Rank 


p- value of Median 


Rank of Last 


p- value of Last 


Microlensing 
Blue Star 


159 
114 


6.577 x 1CT 50 
4.296 x 10" 61 


1217 
324 


1.102 x 1CT 21 
8.872 x 1CT 42 



data set of 110, 568 detrended time series, and was done 
using the brute force method of HOT SAX. The dis- 
tance of the discord was calculated using the Euclidean 
distance, and the results were then ranked from high- 
est to lowest Euclidean distance (the largest discord be- 
tween nearest neighbors). In the top 1217 results, only 3 
known microlensing events were discovered, with many 
other time series of little significance appearing as false 
positives. The results are summarized in Table [2] It 
is clear that finding significance of subintervals is not 
the goal of HOT SAX, which is more attuned to peri- 
odic series with less noise, where it performs quite well 
[13] . As discussed in Section [2] the distinction between 
anomaly detection and event detection is key, in that 
they have different goals. 

Table 2: Results of MACHO Analysis using HOT SAX. 



Event Type 


Median Rank 


Lowest Rank 


Microlensing 
Blue Star 


34, 233 
21,691 


91,779 
52, 866 



5.2 Synthetic Time Series. In order to examine 
how the number of random restarts required by our 
method affects the results, we performed an analysis 
on synthetic time scries. The data set consists of 2000 
synthetic time series, 323 of which are time series with 
varying sizes of single events, and 50 with two events. 
The remaining 1627 time scries are gaussian noise, with 
a standard deviation of 5.0 and a mean of 0.0. 

For those time series with events, the events were 
created using the following function f(t): 

(5.10) f(t) = he~ { ™* )2 +e 

where e is the error (or noise), h is the height of the 
event (Y-axis), S is the time t at the highest point of the 
event (X-axis), and modifies the length of the event. 
Our events ranged from 9 = 1. 5,. ..,10. 5 by increments 
of 0.5, h = 20, . . . , 100 by increments of 5, and e was 
a random variable drawn from a Gaussian distribution 
with a standard deviation a = 5.0 and a mean fi = 0. 
Thus, the signal to noise ratio of these time series ranged 
from 2 to 20. 

We ran our method on all 2000 synthetic time series, 
and the events were ranked by their p- value. All 423 



events (323 from single event time series, and 2 events 
for each of the 50 time series with two events) were 
in the top 423 ranks, without a single false positive in 
those ranks. All events were found with a p- value of 
6.018 x 10~ 9 or less, whereas all of the most significant 
intervals in the time series without events had p-values 
higher than 2.891 x 10~ 7 . 

HOT SAX performed almost as well on this data. 
In the top ranks, only 7 false positives were reported. 
It is interesting to note that the lower ranked events 
were those that occurred in time series with two events. 
This is due to the fact that the algorithm will consider 
the two events as nearest neighbors, and subsequently 
each will be scored lower. As stated, the comparison 
of such events is ideal for situations in which a time 
series has periodic events and one is attempting to 
discover anomalies within these events, not discovering 
the events themselves. In addition, a second experiment 
was conducted with HOT SAX by adding a single point 
to each of the time series (at a random value between 
1 and N), with a value of —5h. Adding such a value 
is consistent with many domains, such as astronomy, 
where outliers in time series are quite common. After 
conducting the experiment, the results were no longer 
useful, with no correlation in the top ranks with time 
series that contained events. Our method was able 
to reproduce nearly identical results to the original 
synthetic time series experiment. 

The final aspect of this experiment was to consider 
different values for the number of random restarts 
needed for our method. The above experiment was 
performed with 30 random restarts. In addition, the 
analysis was done with 5, 10, 20, 30, 50 random restarts. 
Table [3] summarizes the results of this experiment. It 
is important to note that significant events in this 
experiment are considered those that are ranked above 
a p-value threshold of 10~ 8 , the p-valuc for which all 
pure-noise time series fall below. The column One- 
Event TS represents the number of events found (one 
per time series) in the set of 323 time series with only 
one event. The next column, Two-Event TS, represents 
the number of total events found from the set of 50 
time series with two events (thus, we hope to find 100 
events) . It is clear that increasing the number of random 
restarts also increases the possibility of finding a second 
significant event, but that one begins to experience 
diminishing returns after about 30 random restarts. 



Table 3: Results of synthetic time series analysis for 
determining the required number of random restarts 



Random Restarts 


One-Event TS 


Two-Event TS 


5 


93.5% 


58% 


10 


99.1% 


73% 


20 


100% 


96% 


30 


100% 


100% 


50 


100% 


100% 



For the sake of completeness, the experiment was run 
with all 2000 time series for each number of random 
restarts (5,10,20,30,50). Although no false positives 
were found in any of the runs, there were some false 
negatives (events that were not discovered) when the 
number of random restarts was low. It is important to 
note that those events not discovered when the random 
restarts were below 30 are those with events consisting 
of very few points (where 8 < 3 from f(t) above). This is 
because the event region associated with the surface plot 



of the p- values (explored in Section 4.1l is quite small 



and will often be missed by the optimization method. 

5.3 Timing Results. In this section, we present 
the timing of applying our method using the learned 
probability distributions. The analysis of a single time 
series includes transformation into rank-space (Section 
3.2 1 and then an approximate search for results using 
the optimization algorithm described in Section [4] All 
56 tiles were analyzed in parallel on a cluster, each 
on a separate Xeon E5410 2.3Ghz processor with a 
global memory of 32,768 GB. The computation required 
approximately 4 minutes 23 seconds to complete in 
total, with an average of approximately 0.13 seconds 
per time series. Because the probability distributions 
are computed only once, the method scales linearly 
with the number of time series, each of which computes 
in constant time (due to the approximation method 
described in Section [4]) . A discussion of the complexity 
of computing the initial probability distribution can be 
found in Section |3l 

The source for this project is available at 
the Time Series Center website, located at 
http: / '/timemachine. iic. harvard, edu/. 

6 Conclusion 

In this work, we developed a method for identifying 
statistically significant events in time series without the 
need to model the noise, and without the restriction 
of a fixed-size sliding window. The current literature in 
event detection describes methods requiring one or both 
of these restrictions. 



By first converting a time series to rank-space, 
it allows us to build noise distribution models that 
generalize to any time series. All possible intervals of the 
new time series can then be considered, and the sums 
of the values inside the window can be compared to a 
probability distribution of possible sums to determine 
statistical significance. Scan statistics motivated our 
work in developing probability distributions for sliding 
windows. In addition, we showed two methods for 
developing these distributions, one using an analytic 
method that is often computationally intractable, and 
another using a quicker Monte Carlo method. Finally, 
when identifying events, we showed that minimization 
techniques can be used to reduce the complexity of the 
brute force method by approximating our results. 

Our method has three parameters. First, the 
number of random restarts must be defined, but the 
sensitivity of this parameter is small, as our analysis 
showed that one begins to see diminishing returns after 
a reasonable number of random restarts. Second, if 
one uses the more computationally tractable method 
for computing the probability distribution (as outlined 



in Section 3.4 1, the analyst must define e (the acceptable 
error). Third, in Section 4.2 the value 9 was used to 
define what was considered the same event region. If 
one chooses a 9 that is too small, different events may 
be clustered together and one may not find all events 
in a time series. On the other hand, if 9 is too large, 
the results may report the same event regions several 
times. In our evaluation, the results of the method 
are fairly insensitive to this value, as it can range from 
0.05 < 9 < 0.75 with little change in the results. 

We successfully performed our method on the MA- 
CHO data set. In a set of data full of interesting events, 
such as microlensing and blue stars, we were able to 
identify the most likely candidates to be something of 
interest to an astronomer. We then compared these to 
known events, and showed that our method was able 
to recover all known events. Our method identified 
several events that generally fail traditional tests, and 
furthermore, identified several events that are currently 
unidentified. Next, we performed the same analysis on 
synthetic time series of varying sizes, lengths and with 
different noise characteristics. The algorithm performed 
as expected (returning no false positives), and empirical 
results were presented for the only parameter required 
by the method, the number of random restarts for the 
minimization technique. Finally, to clarify the distinc- 
tion between event and anomaly detection, our method 
was compared to a leading anomaly detection algorithm, 
and our method was shown to find stronger results in 
our motivating domain and when performing the anal- 
ysis on synthetic time series. 



This paper presented results from a subset of MA- 
CHO, which was large enough to understand the effi- 
cacy of our method. In the near future, we will apply 
our method to other large astronomical surveys such as 
MACHO, TAOS and OGLE. Other upcoming surveys, 
such as Pan-STARRS and LSST, could also benefit from 
using this method to analyze the millions of stars that 
require analysis. 
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